An Analysis for the Suicides Across the World

Abstract

In this study, the suicide rates of several countries around the worlwide are examined through some of the features like GDP of the countries, suicide numbers by generations and suicide numbers by sex. The analysis is done using several data visualizations and multiple linear regression model. At the end, a predictive model is constructed to be able to predict suicide rates by new data. The results showed that the generation, gender and the continent of a person affect that individual’s probability of committing suicide significantly.

Exploratory Data Analysis

Dataset

The dataset chosen for this study is related with the Suicide Rates across the countries worlwide. This dataset can be accessed here. The data has a total of 101 distinct countries and there are multiple records for each country that show suicides per year, per gender and per age information. In total, there are 27820 observations of 12 features and some of the pertinent features are number of suicides, sex, age, year, population of the countries, gdp and gdp per capita of the countries, Human Development Index (HDI) value for the countries and the generations which people belong to. From these fields, some of the ones that are thought to be redundant are removed and some of the names of the fields are changed. Furthermore, a different column named “continent” is added that shows the respective continents of the countries which will be used in the study. After such data preprocessing tasks are applied, the resulting dataset has 9 features which can be seen below.

Visualizations

Before starting on the statistical analysis, several visualizations are constructed to have a better understanding of the data in order to have a preliminary idea for some of the questions that may arise from that data. The current dataset carries several variables that a researcher might be interested in the possible relationships or correlations between them, such as continent, sex, suicides, gdp per capita and Generation. Several graphics are created to explore such possible relationships, namely box-plots, bar charts and scatter plots. These visualizations are given below.

The first visualizations at hand are created to explore the trends about suicide numbers worlwide. The first plot below is a bar chart that shows the total suicide numbers by continents over the whole dataset. It can be seen that the most of the suicides happen in Europe, Americas and Asia. In the second plot, the suicide numbers by continents can be seen by their values over the time frame of the dataset which is between 1985 and 2016. The trends show that the suicide numbers vary between continents. For instance, while Europe and Asia seem to have a great variation in their suicide numbers over the years (with several increases and descreases), the changes in suicide numbers in Americas seem to be much less.

The next plot is a bar plot that also shows the suicide numbers by continent, but this time, including the differentiation by the gender. From the chart, it can be seen that in every country, males commit suicide significantly more than the females.

The other suicide trends plot can be constructed using the differentiation by the generations. The faceted grid graph below shows the suicidal rates (per 1 million people) change over the years by generation. It can be seen that the highest suicidal rates belong to the oldest generation (G.I. Generation) and the lowest suicidal rates belong to the newest generation (Generation Z) and all of the other generations are conversely-ordered as well.

It is another point of interest that there might be a relationship with the economic strength of a country and the suicide rates of that country. To do this, the dataset is aggregated and several income levels are defined. These levels indicate countries with per capita income levels such as very low, low, average, high or very high. The break levels for the per capita income groups are defined arbitrarily as 10000,20000,30000 and 40000.

In the plot below, the y axis shows the number of suicides for a country and x axis shows the per capita income values of that country with the defined economic strength levels. It can be seen that the black smoothing line shows an increase in suicides while the gdp per capita numbers increase from “low” to “high” income, but then the number of suicides start to decrease as the per capita numbers approach from “high” to “very high” income interval.

Scatter plot showing relationships between suicide numbers - per capita income

Methods and Results

In this section, the variables like generation, continent, gdp_capita and sex will be used as independent possible predictor variables and the variable suicides will be used as the dependent possible response variable. Since the outcome of interest variable, (number of) suicides, is a numeric variable, the statistical model that will be used is the multiple linear regression model.

The initial regression model will be constructed using “suicides” as the response variable and other variables except “country” and “year” as the possible predictor variables. This initial model and the result of it are as follows:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) -176.1 25.64 -6.866 6.765e-12
sexmale 273.8 8.201 33.38 1.431e-239
population 0.000146 1.554e-06 93.96 0
gdp -6.466e-12 4.362e-12 -1.482 0.1383
gdp_capita -0.0003899 0.0002544 -1.532 0.1254
generationG.I. Generation 8.384 16.55 0.5065 0.6125
generationGeneration X -134.2 12.94 -10.37 3.666e-25
generationGeneration Z -302.2 20.49 -14.74 5.115e-49
generationMillenials -227.3 13.27 -17.12 2.098e-65
generationSilent 1.95 13.08 0.1491 0.8815
continentAmericas 1.905 24.57 0.07753 0.9382
continentAsia 139.1 25.34 5.488 4.108e-08
continentEurope 203.1 24.67 8.232 1.925e-16
continentOceania 126.2 32.31 3.907 9.356e-05
Fitting linear model: suicides ~ . - country - year
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
27820 683.9 0.4255 0.4252

From the summary of the initial regression model above, it can be seen that R automatically creates dummy variables for the categorical variables. In this case, the dummy variable for the sex variable is sexMale (or base variable as sexFemale), the base variable for generation is generationBoomer and the base variable for the continent is continentAfrica.

The first thing to explore about this model is to decide whether at least one of the independent variables is significant in terms of predicting the outcome, which means that the null hypothesis ( \(H_0 : \beta_1 = \beta_2 = ... = \beta_n = 0\) , where \(\beta_{1,2,...,n}\) are the coefficient estimates for the variables) should be rejected in favor of the alternate hypothesis. This result can be seen by the F-Statistic and the corresponding p-value for it. The F-Statistic and the p-value upwards are the results of Analysis of Variance (ANOVA) test where the model with several predictor variables are compared against the model with zero predictor variable. The result of this test that is given below indicates that since the specific p-value is considerably less than 0, it is said that the null hypothesis is rejected in the favor of alternate hypothesis which means there is at least one predictor variable that is statistically significant (useful).

Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
27819 2.264e+10 NA NA NA NA
27806 1.3e+10 13 9.632e+09 1584 0

Furthermore, the next question of interest is to decide which variables may be useful for the regression model. This means that how much and to what extent the independent variables are correlated with the response variable. The strength of the relationship between the dependent variable suicides and other independent variables can be seen via p-values from the results. We can infer from the p-values given below that the variables sex, population, dummy variables for generations X,Z,Millenials and dummy variables for continents Asia, Europe and Oceania are statistically significant variables.

  Estimate Pr(>|t|)
(Intercept) -176.1 6.765e-12
sexmale 273.8 1.431e-239
population 0.000146 0
gdp -6.466e-12 0.1383
gdp_capita -0.0003899 0.1254
generationG.I. Generation 8.384 0.6125
generationGeneration X -134.2 3.666e-25
generationGeneration Z -302.2 5.115e-49
generationMillenials -227.3 2.098e-65
generationSilent 1.95 0.8815
continentAmericas 1.905 0.9382
continentAsia 139.1 4.108e-08
continentEurope 203.1 1.925e-16
continentOceania 126.2 9.356e-05

The magnitudes of the relationships between significant variables and the number of suicides are related with how much of a 1 unit change in these variables would effect the suicide number which can be seen by the coefficients of these variables. For instance, the coefficient of the dummy variable “sexMale” is 273.76, which means that while controlling for other variables like generation and continent, the suicide numbers for a male is 273.76 more than a female on average. Moreover, keeping (controlling) other variables constant, the suicide numbers for the Europe is 203.1 more than the Africa on average.

Another property to look for in the results of the regression model is the goodness of fit of the model to the dataset. The measures for the goodness of fit of a model are F-Statistic, Residual Standard Error and Adjusted \(R^2\) Value. These values are 1584.2, 683.9 and 0.425 respectively. A rule of thumb approach says that Adjusted \(R^2\) value below than 0.3 means a bad model fit, more than 0.3 but less than 0.7 means an average fit and more than 0.7 means a good fit. Thus, we may say the model fits averagely to the data and it is not a good model.

From the results table, it is seen that some of the variables like “gdp” and “gdp_capita” have high p-values (more than 0.01 or 0.05 thresholds) like 0.138 and 0.125. These values imply that these variables are not significant in this model. This phenomenon might happen due to the collinearity between these variables or other variables carrying out the “information” that these variables carry. Thus, to further investigate which of these variables might be excluded or included into the final model, the process named feature selection should be done. There are several feature selection models in both statistics and detailed ones in Machine Learning. One of the most commonly used functions in R software for feature selection is step function. This function selects features by adding (forward), subtracting (backward) or both adding and subtracting features 1 by 1 to acquire the best possible performance for the model. The performance measure for the models are Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The result of the feature selection with these performance measures are given below.

df BIC
aicModel 15 442294.1
bicModel 13 442280.8
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -174.8 25.5 -6.856 7.236e-12
sexmale 273.6 8.201 33.36 2.836e-239
population 0.0001441 1.063e-06 135.5 0
generationG.I. Generation 7.552 16.39 0.4609 0.6449
generationGeneration X -135.1 12.93 -10.45 1.567e-25
generationGeneration Z -309.2 20.33 -15.21 4.66e-52
generationMillenials -230.5 13.21 -17.44 8.891e-68
generationSilent -1.217 13 -0.09363 0.9254
continentAmericas -0.1538 24.56 -0.006261 0.995
continentAsia 134.9 25.26 5.341 9.297e-08
continentEurope 195.2 24.33 8.025 1.056e-15
continentOceania 118.6 32.13 3.69 0.0002244
Fitting linear model: suicides ~ sex + population + generation + continent
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
27820 683.9 0.4254 0.4251

It is found from using both of the measures that the best model is created by subtracting both the gdp and gdp_capita variables and leaving the rest. However, the increase in goodness of fit was slight since only the F-Statistic increased significantly to 1871.3 and the adjusted \(r^2\) value stayed almost the same compared to initial model with a value of 0.4252.

Predictions

Using the best available model found, the next step for the regression model analysis is to check how great the model’s ability to predict outcome variable really is. There are 2 different approaches for the model data and prediction data to be used in this study. The first approach is to divide the model into a 2 parts such that, for instance, 70% of the data will be in the model training dataset and the remaining 30% of the data will be in the prediction or test dataset. The second approach will be to use one of the most common cross-validation techniques, namely k-fold cross-validation.

The first model using training and test dataset technique is given below.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) -165.1 31.64 -5.217 1.839e-07
sexmale 273.9 10.13 27.03 5.005e-158
population 0.0001453 1.3e-06 111.8 0
generationG.I. Generation 8.359 20.21 0.4137 0.6791
generationGeneration X -142.1 15.95 -8.907 5.688e-19
generationGeneration Z -327.4 24.96 -13.12 3.893e-39
generationMillenials -235.8 16.31 -14.45 4.11e-47
generationSilent -8.822 16.06 -0.5492 0.5829
continentAmericas -9.405 30.42 -0.3092 0.7572
continentAsia 120.1 31.31 3.835 0.0001257
continentEurope 192.8 30.15 6.395 1.639e-10
continentOceania 109.8 39.8 2.758 0.005816
Fitting linear model: suicides ~ . - country - year - gdp - gdp_capita
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
19474 706.8 0.4181 0.4177

After the model is created, the outcome variable “number of suicides” is predicted using the test dataset that is the 30% of the original data. The predictions using test data set is shown below.

Root MSE by Train/Test Approach
627.2308

There are several accuracy methods that can be used for a quantitative outcome like Mean Squared Error(MSE), Root Mean Squared Error (RMSE) or Mean Absolute Error(MAE). Here the RMSE is chosen as the accuracy measure of the predictions. The RMSE value after the model that uses training and test data is 627.23.

The second approach for the prediction modeling is to use k-fold cross validation technique. Simply, the data is divided into k-folds and the model is run k times, each time using k-1 folds as the training data and 1 fold as the data. This approach is generally preferable to simple training/test data divide since it prevents overfitting better. The most common k number is 10 which will also be 10 in this study. The predictions using k-fold cross validation is given below.

intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 679.1 0.4312 270.1 90.51 0.04823 10.25

It can be sen that the RMSE value after the predictions are done using the 10-fold cross validations has decreased to 679.12 when compared to a single training/test divide. This result is on par with the claim that the k-fold cross validation reduces the prediction error further than the single training/test dataset approach.

Discussion and Future Work

In this study, the suicide rates across different countries of the world are examined according to different features like generation, gender, population, gdp or continent of the country. The features of the data are visualized through several visualization tools like bar plots, scatter plots and line plots. After some intuition about the data is acquired, the features that are thoguht to be relevant for predicting the outcome of number of suicides in a country are put into a multiple linear regression model. The result of the models showed that the generation, sex and the continent a specific person lives in significantly affect the possibility of that person to commit suicide. Such results were that the older generations, males and people living in 1st world countries like the ones in Europe and America have more likelihood of committing suicide.

The results and inferences of this study can be further used and extended in different social science studies like finding the reasons of why the older people commit suicides more or finding the reasons why people commit suicide significantly more in the 1st world than the 3rd world. Results of the such researches might then be used by the government officials to help lessen the suicide numbers in their respective country.

References

[1] H. Schumacher, “Why more men than women die by suicide”, Bbc.com, 2019. [Online]. Available: https://www.bbc.com/future/article/20190313-why-more-men-kill-themselves-than-women. [Accessed: 20- Dec- 2019].

DATA PROCESSES FINAL PROJECT
UPM Facultad de Informatica
Raymond Banks, Houda Briwa, Emre Hangül, Fatemeh Mohammadi, Hugo Nugra

20 12 2019